feat: implemented sampling for MTP by SamuelOliveirads · Pull Request #1 · F1LM1/llama.cpp

SamuelOliveirads · 2025-09-03T18:26:06Z

Hi @F1LM1, I've been following your PR and decided to tackle one of the to-dos you mentioned: implementing proper sampling for the MTP draft model.

I've successfully implemented a solution that retrieves the full logits from the MTP and passes them to the sampler for the draft token generation. The code seems stable and is ready for your review.

Here are the key results from my testing:

Greedy Decoding (Temp 0.8): My implementation performed identically to the original PR, with an acceptance rate of 0.756.
Creative Sampling (Temp 2.0): The performance in my debug tests was initially worse for creativity (acceptance rate of 0.566 vs. your original 0.670). I believe this is because allowing the MTP more freedom to choose a token via sampling can increase the divergence from the main model, leading to lower acceptance.

Interestingly, when compiled in release mode, I achieved an average acceptance rate of 0.51 for creative tasks, which as you mentioned in your PR it was around of 0.4.

I tried to preserve your code and I'm open to suggestions you have for improvement.

F1LM1 · 2025-09-04T01:19:01Z

Hi, thanks for this! I think this is similar to, though probably a cleaner version than, what I had before I changed the MTP sampler to a simple argmax. I'll tell you what I think, let me know to what extent this does/does not agree with your understanding:

We probably do not want to call common_sampler_accept(smpl, ...) on the draft token, my understanding is that this causes us to modify our sampler -- which is the same object as the base model's sampler! -- as if the token was actually sampled, but we don't actually know if the draft token will be accepted yet. Plus, if the draft token actually gets accepted, then we'll call common_sampler_accept(smpl, ...) on it again and double-count the token
The optimal sampling strat on the MTP logits is to use samplers that transform the input logits (act pre-softmax), but then to do a greedy sampling after that. Let's say the MTP head was "perfect," i.e. on step N it returns the exact logits that the main model would return on step N+1. A full stochastic sampling is just inserting noise, by definition greedy sampling on the correct post-transformation logits gets you the single token that has the highest probability of being sampled by the main model on step N+1. This is why I switched to using a greedy sampling in my code. But sampling the pre-transform (from things like repetition, presence, DRY, etc.) is also incorrect. I haven't really looked into the sampler code so I don't know if there's a way to do this... maybe a hacky method would be to set the sampler temp to 0, then call common_sampler_sample, then set the temp back. We probably do want to use the same sampler object as the main model so that we carry over the correct states for stuff like repetition, presence, DRY.

SamuelOliveirads · 2025-09-05T00:15:40Z

Hi, thanks for this! I think this is similar to, though probably a cleaner version than, what I had before I changed the MTP sampler to a simple argmax. I'll tell you what I think, let me know to what extent this does/does not agree with your understanding:

We probably do not want to call common_sampler_accept(smpl, ...) on the draft token, my understanding is that this causes us to modify our sampler -- which is the same object as the base model's sampler! -- as if the token was actually sampled, but we don't actually know if the draft token will be accepted yet. Plus, if the draft token actually gets accepted, then we'll call common_sampler_accept(smpl, ...) on it again and double-count the token

The optimal sampling strat on the MTP logits is to use samplers that transform the input logits (act pre-softmax), but then to do a greedy sampling after that. Let's say the MTP head was "perfect," i.e. on step N it returns the exact logits that the main model would return on step N+1. A full stochastic sampling is just inserting noise, by definition greedy sampling on the correct post-transformation logits gets you the single token that has the highest probability of being sampled by the main model on step N+1. This is why I switched to using a greedy sampling in my code. But sampling the pre-transform (from things like repetition, presence, DRY, etc.) is also incorrect. I haven't really looked into the sampler code so I don't know if there's a way to do this... maybe a hacky method would be to set the sampler temp to 0, then call common_sampler_sample, then set the temp back. We probably do want to use the same sampler object as the main model so that we carry over the correct states for stuff like repetition, presence, DRY.

Hi @F1LM1, thanks for the feedback! My bad that I didn't see your previous commits of a proper sampling implementation. Based on the discussion, I was thinking this was still on your to-do list.

After digging into the code, I think I have a clearer picture.

Regarding common_sampler_accept: I see my mistake. In the standard speculative mode with a separate draft model (ctx_dft), the accept call inside the draft function is used to maintain the state of the draft context for sequential token generation. But for MTP, since we share a single context and sampler, calling it prematurely would indeed pollute the main sampler's state before verification.

Regarding the "Modify Logits + Greedy" Strategy: I'm already drafting an idea for that, which involves accessing the sample method to modify the logits and then getting the first candidate.

This is my concept:

// In common/speculative.cpp
llama_token mtp_speculative_gen_draft(
    struct common_sampler* smpl,
    struct llama_context* ctx,
    llama_token id_last,
    int32_t n_past,
    int32_t last_tok_idx) {

    if (!smpl) {
        return -1;
    }

    llama_batch batch = llama_batch_init(1, 0, 1);
    common_batch_add(batch, id_last, n_past, {0}, true);

    llama_build_and_execute_mtp_graph(ctx, batch, id_last, n_past, last_tok_idx);

    const llama_model * model = llama_get_model(ctx);
    const llama_vocab * vocab = llama_model_get_vocab(model);
    const int n_vocab = llama_n_vocab(vocab);

    llama_token_data_array * cur_p = common_sampler_get_candidates(smpl);

    cur_p->size = n_vocab;
    for (int i = 0; i < n_vocab; ++i) {
        cur_p->data[i].id = i;
        cur_p->data[i].logit = llama_get_logits_ith(ctx, last_tok_idx)[i];
    }
    cur_p->sorted = false;

    common_sampler_apply_chain(smpl, cur_p);

    const llama_token id = cur_p->data[0].id;

...
// In common/sampling.cpp
void common_sampler_apply_chain(struct common_sampler * gsmpl, struct llama_token_data_array * cur_p) {
    llama_sampler_apply(gsmpl->chain, cur_p);
}

I'll test this out a bit, but in the meantime, I'm open to feedback.

I also have another question: I'm looking at how to implement the n+2 tokens for MTP and I don't know how far you are on that. If you'd like, I can also try some concepts. In my mind, this is my plan:

a new function, probably in llama-context.cpp, called llama_mtp_kv_cache_update_token to run the MTP graph and generate only one token to ensure that the KV Cache is updated.
Refactor mtp_update_kv_cache to call llama_mtp_kv_cache_update_token.
Refactor mtp_speculative_gen_draft to loop the function llama_build_and_execute_mtp_graph, apply the sampling, and get the temp 0 draft. The concept of the loop could be similar to common_speculative_gen_draft.

F1LM1 · 2025-09-05T06:15:59Z

Regarding the "Modify Logits + Greedy" Strategy: I'm already drafting an idea for that, which involves accessing the sample method to modify the logits and then getting the first candidate.
...
I'll test this out a bit, but in the meantime, I'm open to feedback.

This seems reasonable to me, with luck it will show up as better acceptance rates when rep penalties are turned on :)

I also have another question: I'm looking at how to implement the n+2 tokens for MTP and I don't know how far you are on that. If you'd like, I can also try some concepts. In my mind, this is my plan:

Haven't really started thinking about this. When I have some free time I plan to focus on seeing if we can do some basic optimizations like graph reuse and such. You're definitely welcome to work on this!

* a new function, probably in llama-context.cpp, called `llama_mtp_kv_cache_update_token` to run the MTP graph and generate only one token to ensure that the KV Cache is updated.

* Refactor `mtp_update_kv_cache` to call `llama_mtp_kv_cache_update_token`.

* Refactor `mtp_speculative_gen_draft` to loop the function `llama_build_and_execute_mtp_graph`, apply the sampling, and get the `temp 0` draft. The concept of the loop could be similar to `common_speculative_gen_draft`.

Frankly I don't know exactly what the multi-head case for MTP would look like, but my impression that you cannot MTP draft N tokens simply by autoregressively predicting with a single MTP head the way you can for a typical draft model. Rather, I believe that the number of MTP heads is a fixed feature of the model/weights, so that if you wanted to draft say N = 5 tokens at once the model would have to have at least N = 5 MTP layers/heads that would all produce outputs in a single forward pass of the full model (including MTP). I would've guessed that each MTP layer takes as input the previous layer's output embedding and its sampled token (to concatenate as an input embedding the way we do for the single MTP head here), and rather than having to run llama_build_and_execute_mtp_graph in a loop you should only need to run it once and just collect the N outputs.

But if you find material to the contrary, I would absolutely love to see it.

SamuelOliveirads · 2025-09-06T03:27:54Z

This seems reasonable to me, with luck it will show up as better acceptance rates when rep penalties are turned on :)

Hey! I tried with 52 requests, between 6k to 38k of context and was able to get an acceptance rate of ~0.5931 +/- 0.041 as compared with the previous rate of ~0.51 that I reported before. This was with the same settings (temp=1.0, DRY enabled) for writing. The latest commit includes these changes.

Frankly I don't know exactly what the multi-head case for MTP would look like, but my impression that you cannot MTP draft N tokens simply by autoregressively predicting with a single MTP head the way you can for a typical draft model. Rather, I believe that the number of MTP heads is a fixed feature of the model/weights, so that if you wanted to draft say N = 5 tokens at once the model would have to have at least N = 5 MTP layers/heads that would all produce outputs in a single forward pass of the full model (including MTP). I would've guessed that each MTP layer takes as input the previous layer's output embedding and its sampled token (to concatenate as an input embedding the way we do for the single MTP head here), and rather than having to run llama_build_and_execute_mtp_graph in a loop you should only need to run it once and just collect the N outputs.

But if you find material to the contrary, I would absolutely love to see it.

I was unable to find proper documentation or even discussion, only mentions to look at SGLang and VLLM, so I did look at how VLLM implemented and their approach is as follows:

self.num_mtp_layers = config.num_nextn_predict_layers

self.layers = torch.nn.ModuleDict({

    str(idx): Glm4MoeMultiTokenPredictorLayer(...)

    for idx in range(self.mtp_start_layer_idx,

                     self.mtp_start_layer_idx + self.num_mtp_layers)

})

...

def forward(..., spec_step_idx: int = 0):

    ...

    current_step_idx = (spec_step_idx % self.num_mtp_layers)

    return self.layers[str(self.mtp_start_layer_idx + current_step_idx)](...)

So if we pass for example num_nextn_predict_layers = 3 they will create 3 heads. To generate the draft, they use an autoregressive loop, but for each step, it uses a different head for prediction. SGLang is similar, it has a Glm4MoeDecoderLayer with a boolean called is_nextn to choose if the layer will be MTP or not. So yeah, you are right about just running llama_build_and_execute_mtp_graph once but using different heads.

My proposed plan now shifts to two steps:

Modify build_mtp_graph to loop through the N MTP heads (model.layers[n_layer - num_mtp_layers + i]). It will build a single, larger computation graph that chains the output of MTP_head_i as the input for MTP_head_i+1.
Now that the graph has all the logits outputs, llama_build_and_execute_mtp_graph will execute the graph only once and will copy the N logits to the main context's logit buffer.

mtp_speculative_gen_draft will continue to do the same as before: call the graph, collect the logits, and apply the logits + greedy sampling approach.

F1LM1 · 2025-09-08T03:25:27Z

This seems reasonable to me, with luck it will show up as better acceptance rates when rep penalties are turned on :)

Hey! I tried with 52 requests, between 6k to 38k of context and was able to get an acceptance rate of ~0.5931 +/- 0.041 as compared with the previous rate of ~0.51 that I reported before. This was with the same settings (temp=1.0, DRY enabled) for writing. The latest commit includes these changes.

Great, I've been away last couple of days but I'll give this a spin as well, sounds promising!

I was unable to find proper documentation or even discussion, only mentions to look at SGLang and VLLM, so I did look at how VLLM implemented and their approach is as follows:
self.num_mtp_layers = config.num_nextn_predict_layers

self.layers = torch.nn.ModuleDict({

    str(idx): Glm4MoeMultiTokenPredictorLayer(...)

    for idx in range(self.mtp_start_layer_idx,

                     self.mtp_start_layer_idx + self.num_mtp_layers)

})

...

def forward(..., spec_step_idx: int = 0):

    ...

    current_step_idx = (spec_step_idx % self.num_mtp_layers)

    return self.layers[str(self.mtp_start_layer_idx + current_step_idx)](...)
So if we pass for example num_nextn_predict_layers = 3 they will create 3 heads. To generate the draft, they use an autoregressive loop, but for each step, it uses a different head for prediction. SGLang is similar, it has a Glm4MoeDecoderLayer with a boolean called is_nextn to choose if the layer will be MTP or not. So yeah, you are right about just running llama_build_and_execute_mtp_graph once but using different heads.

My proposed plan now shifts to two steps:
1. Modify `build_mtp_graph` to loop through the N MTP heads (`model.layers[n_layer - num_mtp_layers + i]`). It will build a single, larger computation graph that chains the output of `MTP_head_i` as the input for `MTP_head_i+1`.

2. Now that the graph has all the logits outputs, `llama_build_and_execute_mtp_graph` will execute the graph only once and will copy the N logits to the main context's logit buffer.
mtp_speculative_gen_draft will continue to do the same as before: call the graph, collect the logits, and apply the logits + greedy sampling approach.

If I'm reading this correctly it looks like if num_mtp_layers = 1 then it will run the one MTP layer autoregressively, but if num_mtp_layers = 2 for example then it will alternate between the layers? That seems... odd, but I agree it can't hurt to match their implementation until we have an example of a model with num_mtp_layers > 1 to if it works. Hopefully we'll see decent draft acceptance at least for the "easy" cases (coding), and even if not, it's easy enough to just recommend choosing the N that ends up working best.

SamuelOliveirads · 2025-09-09T00:17:14Z

If I'm reading this correctly it looks like if num_mtp_layers = 1 then it will run the one MTP layer autoregressively, but if num_mtp_layers = 2 for example then it will alternate between the layers? That seems... odd, but I agree it can't hurt to match their implementation until we have an example of a model with num_mtp_layers > 1 to if it works. Hopefully we'll see decent draft acceptance at least for the "easy" cases (coding), and even if not, it's easy enough to just recommend choosing the N that ends up working best.

Yes, the alternating layer logic seems odd. I felt the same way, especially since we've only seen models with a single MTP head, and the previous layers don't have the nextN weights.
That feeling made me dive deeper into the vLLM and SGLang implementations. It turns out both effectively use an "Eagle worker" as a "proposer" pattern (which is compatible with DeepSeek-like models) to generate the draft tokens.
In my previous example, if a model had num_mtp_layers = 3, the spec_step_idx % 3 logic would indeed alternate between three different MTP layers. But since the models we have only have one mtp layer, it effectively just loops on the same head.
The key was finding the code that calls the forward function. The actual autoregressive generation doesn't happen inside the model graph, but in the worker that drives it. For example, in vLLM's Eagle proposer, you find this explicit loop:

# in vllm/spec_decode/eagle.py
class EagleProposer:
...
    def propose(self, ...):
        ...
        # Generate the remaining draft tokens.
        draft_token_ids_list = [draft_token_ids]
        for _ in range(self.num_speculative_tokens - 1):
            # The input for this iteration is the token generated in the previous one.
            input_ids = draft_token_ids_list[-1].int()
            # Runs the model for a single step
            last_hidden_states, hidden_states = self.model(...)
            # Calculates logits and samples the next token (with argmax)
            logits = self.model.compute_logits(last_hidden_states[:batch_size], None)
            draft_token_ids = logits.argmax(dim=-1)
            # Appends the new token for the next iteration
            draft_token_ids_list.append(draft_token_ids)

This is essentially what our mtp_speculative_gen_draft does: it calls the graph executor (llama_build_and_execute_mtp_graph), then the logic returns to the C++ side to sample the logits, and the loop feeds the result into the next iteration.
The bottleneck, is the dependency on the sampled token. We have to exit the graph to make that decision on the CPU.
This discovery meant I had to pivot from my original plan of a single large graph. The new approach is to loop within mtp_speculative_gen_draft, creating and executing a small graph for each draft token. I've put together a draft PR to show how this works in practice. I'd really appreciate your thoughts and feedback when you have a moment.

F1LM1 · 2025-09-10T04:48:53Z

As I commented on the other PR, I suspect that supporting KV cache for multi-token MTP drafts is going to be a significant step up in complexity, while the one token case we can piggyback on the existing KV cache system (since it sets aside cache for the single MTP layer already).

I'll get a chance tomorrow to spin up this PR but I think this should represent the optimal sampling subroutine. If you're eager to finish it off, maybe start thinking about how we can make the setup more efficient by reusing stuff where possible (memory ctx? graphs? >1 size batches?), since we're basically recreating a bunch of stuff from scratch for every token.

SamuelOliveirads · 2025-09-11T01:39:38Z

I'll get a chance tomorrow to spin up this PR but I think this should represent the optimal sampling subroutine. If you're eager to finish it off, maybe start thinking about how we can make the setup more efficient by reusing stuff where possible (memory ctx? graphs? >1 size batches?), since we're basically recreating a bunch of stuff from scratch for every token.

Okay, I'll take a look at what you suggested and find ways to store the state for the context and graph.

Regarding the batch size part, if I understand correctly, you're referring to fixing the alternation between draft and main model tokens in the server's main loop. I agree that would be a great optimization, but it seems like it would take a while, change a lot of the server logic, and require extensive testing.

I feel like that would be a good follow-up PR. It's more of a general feature to improve not only MTP but drafting in general, and giving it a separate PR would allow us to merge the core MTP implementation first.

F1LM1 · 2025-09-11T03:49:25Z

Regarding the batch size part, if I understand correctly, you're referring to fixing the alternation between draft and main model tokens in the server's main loop. I agree that would be a great optimization, but it seems like it would take a while, change a lot of the server logic, and require extensive testing.

Nah, I meant some form of batching when we do the MTP layer prompt processing step, since we're likely going to process hundreds of tokens at once using the same graphs/memory context/etc. Right now we're building all 1-size batches, which just feels wrong.

AFAIK, the alternation thing might be deceptively easy to fix. I suspect it could be as simple as just making sure we only do the non-speculative llama_decode step exactly once, i.e. immediately after prompt processing. I'll need to find my notes on this, but I'm pretty sure everything else is always correctly synced, at least for the MTP case.

SamuelOliveirads · 2025-09-12T02:13:59Z

Nah, I meant some form of batching when we do the MTP layer prompt processing step, since we're likely going to process hundreds of tokens at once using the same graphs/memory context/etc. Right now we're building all 1-size batches, which just feels wrong.

Ah, thanks for clarifying the batching part! I was looking at the main generation loop, but batching the MTP prompt processing makes sense. I'll keep that on my list of things to look into.

AFAIK, the alternation thing might be deceptively easy to fix. I suspect it could be as simple as just making sure we only do the non-speculative llama_decode step exactly once, i.e. immediately after prompt processing. I'll need to find my notes on this, but I'm pretty sure everything else is always correctly synced, at least for the MTP case.

That's a great point about the alternation fix potentially being simple; I'm looking forward to your notes on that when you find them.

In the meantime, I've been focused on your first suggestion: reusing resources during the single-token draft generation to avoid recreating everything from scratch. I've been experimenting with a llama_mtp_state object to persist the graph and context, but I've run into problems.

My latest attempt was to build the graph once inside llama_build_and_execute_mtp_graph and store the ggml_cgraph in the llama_mtp_state. The setup works for the first call, but it crashes on subsequent calls inside ggml_backend_sched_alloc_graph.

My diagnosis is that the ggml_cgraph created by model->build_mtp_graph seems to have a strong dependency on the temporary mctx and params (including the sched) that were used to build it. When we try to reuse that persistent graph with a new, temporary sched and mctx on the next call, ggml detects an incompatibility and fails. It seems the graph and the scheduler/context are tightly coupled.

It feels like a chicken-and-egg problem. To properly reuse the graph, we'd need to persist the scheduler, but to persist the scheduler, we'd need to persist the memory context, which is the core of the KV cache problem.

I'm probably missing something due to my limited knowledge of the ggml backend internals. Is there a good way to attack this? For example, is it possible to "re-bind" a persistent ggml_cgraph to a new scheduler or memory context on each execution? Or perhaps we could create a "template" graph with dummy inputs that can then be used with real contexts later?

Any pointers would be be a huge help.

F1LM1 · 2025-09-13T06:59:53Z

I finally got a chance to test the improved sampler. It works well in my testing, raising draft acceptance rate in some "hard" writing scenarios by more than 10 percentage points on average, which is a clear and large gain. Ironically it ends up being slower in actual tok/s generation, presumably because the sampling chain as-is is inefficient, but let's see what we can do about that in follow-ups.

Re: the graph reuse questions you mentioned above, I'll fire up the project again this weekend and see what I find. It's been a while since I dove in.

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 F1LM1#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 F1LM1#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <[email protected]> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9 Author: samuel <[email protected]> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a Author: samuel <[email protected]> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf Author: samuel <[email protected]> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394 Author: samuel <[email protected]> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff20 Merge: 171346c cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c Author: samuel <[email protected]> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit 0127c6b Author: samuel <[email protected]> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit 4bcc9e2 Author: samuel <[email protected]> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit b4cbe03 Author: samuel <[email protected]> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit a99709d Author: samuel <[email protected]> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit 913af8f Author: samuel <[email protected]> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit 6f74ba3 Author: samuel <[email protected]> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit 5e1d719 Author: samuel <[email protected]> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit febd823 Author: samuel <[email protected]> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit 67c6c06 Author: samuel <[email protected]> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit 75dc25e Author: samuel <[email protected]> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit 3da7e7f Author: samuel <[email protected]> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit df64508 Author: samuel <[email protected]> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit 042eb8a Author: samuel <[email protected]> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit 1318b2d Author: samuel <[email protected]> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit c6237c7 Merge: 9fab53e 8742ce0 Author: Aaron Lee <[email protected]> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request F1LM1#1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit 8742ce0 Author: samuel <[email protected]> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit 5a5bce8 Author: samuel <[email protected]> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit 07670a2 Author: samuel <[email protected]> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit 9fab53e Author: Aaron Lee <[email protected]> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit 98bc0c6 Author: Aaron Lee <[email protected]> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit 471e026 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit d72f9d5 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit 382135a Author: Aaron Lee <[email protected]> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit 6870f97 Author: Aaron Lee <[email protected]> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit 6e9bafc Author: Aaron Lee <[email protected]> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit cf0f7c0 Author: Aaron Lee <[email protected]> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit 03231da Author: Aaron Lee <[email protected]> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit 1f477b3 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit e434f87 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit db60623 Author: Aaron Lee <[email protected]> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property

commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <[email protected]> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9 Author: samuel <[email protected]> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a Author: samuel <[email protected]> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf Author: samuel <[email protected]> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394 Author: samuel <[email protected]> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff20 Merge: 171346c cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c Author: samuel <[email protected]> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit 0127c6b Author: samuel <[email protected]> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit 4bcc9e2 Author: samuel <[email protected]> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit b4cbe03 Author: samuel <[email protected]> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit a99709d Author: samuel <[email protected]> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit 913af8f Author: samuel <[email protected]> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit 6f74ba3 Author: samuel <[email protected]> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit 5e1d719 Author: samuel <[email protected]> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit febd823 Author: samuel <[email protected]> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit 67c6c06 Author: samuel <[email protected]> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit 75dc25e Author: samuel <[email protected]> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit 3da7e7f Author: samuel <[email protected]> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit df64508 Author: samuel <[email protected]> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit 042eb8a Author: samuel <[email protected]> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit 1318b2d Author: samuel <[email protected]> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit c6237c7 Merge: 9fab53e 8742ce0 Author: Aaron Lee <[email protected]> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit 8742ce0 Author: samuel <[email protected]> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit 5a5bce8 Author: samuel <[email protected]> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit 07670a2 Author: samuel <[email protected]> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit 9fab53e Author: Aaron Lee <[email protected]> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit 98bc0c6 Author: Aaron Lee <[email protected]> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit 471e026 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit d72f9d5 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit 382135a Author: Aaron Lee <[email protected]> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit 6870f97 Author: Aaron Lee <[email protected]> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit 6e9bafc Author: Aaron Lee <[email protected]> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit cf0f7c0 Author: Aaron Lee <[email protected]> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit 03231da Author: Aaron Lee <[email protected]> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit 1f477b3 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit e434f87 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit db60623 Author: Aaron Lee <[email protected]> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property

ggml-org#958) * port upstream ggml-org#16932 * Add fixed chat templates. * fix grammar when tool have no argument * Insert additional stops for Kimi-K2 * Fix `no triggers set for lazy grammar!` for GLM4.5/4.6 * update chat.cpp * fix grammar for GLM 4.5/4.6 * chat: Fix streaming parser for granite models (ggml-org#15682) * fix(chat): fix streaming parser for granite models * tests: add test cases for Granite models chat parser * common : Fix corrupted memory error on json grammar initialization (ggml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 F1LM1#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 F1LM1#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING * common : fix reasoning before forced tool call via tool_choice = required (ggml-org#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984) --------- Co-authored-by: Alde Rojas <[email protected]> * Try fix Jinja template for GLM * Improve Kimi-K2 chat template * Fix "Invalid tool call arguments passed" in a rare case. In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation. --------- Co-authored-by: shun095 <[email protected]> Co-authored-by: David Ribeiro Alves <[email protected]> Co-authored-by: crat0z <[email protected]> Co-authored-by: Alde Rojas <[email protected]>

SamuelOliveirads added 2 commits September 3, 2025 13:25

feat: implemented sampling for MTP

07670a2

fix: add sample acceptance

5a5bce8

feat: apply logits + greedy sampler

8742ce0

F1LM1 merged commit c6237c7 into F1LM1:glm4-moe-mtp Sep 13, 2025

SamuelOliveirads deleted the glm4-moe-mtp branch September 13, 2025 20:36

SamuelOliveirads mentioned this pull request Oct 5, 2025

mtp-batch: batch prompt processing #3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implemented sampling for MTP#1

feat: implemented sampling for MTP#1
F1LM1 merged 3 commits intoF1LM1:glm4-moe-mtpfrom
SamuelOliveirads:glm4-moe-mtp

SamuelOliveirads commented Sep 3, 2025

Uh oh!

F1LM1 commented Sep 4, 2025 •

edited

Loading

Uh oh!

SamuelOliveirads commented Sep 5, 2025

Uh oh!

F1LM1 commented Sep 5, 2025

Uh oh!

SamuelOliveirads commented Sep 6, 2025

Uh oh!

F1LM1 commented Sep 8, 2025

Uh oh!

SamuelOliveirads commented Sep 9, 2025

Uh oh!

F1LM1 commented Sep 10, 2025 •

edited

Loading

Uh oh!

SamuelOliveirads commented Sep 11, 2025

Uh oh!

F1LM1 commented Sep 11, 2025 •

edited

Loading

Uh oh!

SamuelOliveirads commented Sep 12, 2025

Uh oh!

F1LM1 commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SamuelOliveirads commented Sep 3, 2025

Uh oh!

F1LM1 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Sep 5, 2025

Uh oh!

F1LM1 commented Sep 5, 2025

Uh oh!

SamuelOliveirads commented Sep 6, 2025

Uh oh!

F1LM1 commented Sep 8, 2025

Uh oh!

SamuelOliveirads commented Sep 9, 2025

Uh oh!

F1LM1 commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Sep 11, 2025

Uh oh!

F1LM1 commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Sep 12, 2025

Uh oh!

F1LM1 commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

F1LM1 commented Sep 4, 2025 •

edited

Loading

F1LM1 commented Sep 10, 2025 •

edited

Loading

F1LM1 commented Sep 11, 2025 •

edited

Loading